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ABSTRACT 

As general purpose robots become more capable, pre-pro- 
gramming of all tasks at the factory will become less prac- 
tical. We would like for non-technical human owners to be 
able to communicate, through interaction with their robot, 
the details of a new task; we call this interaction "task com- 
munication". During task communication the robot must 
infer the details of the task from unstructured human sig- 
nals and it must choose actions that facilitate this inference. 
In this paper we propose the use of a partially observable 
Markov decision process (POMDP) for representing the task 
communication problem; with the unobservable task details 
and unobservable intentions of the human teacher captured 
in the state, with all signals from the human represented as 
observations, and with the cost function chosen to penalize 
uncertainty. We work through an example representation of 
task communication as a POMDP, and present results from 
a user experiment on an interactive virtual robot, compared 
with a human controlled virtual robot, for a task involving 
a single object movement and binary approval input from 
the teacher. The results suggest that the proposed POMDP 
representation produces robots that are robust to teacher 
error, that can accurately infer task details, and that are 
perceived to be intelligent. 

1. INTRODUCTION 

General purpose robots such as Willow Garage's PR2 and 
Stanford's STAIR robot are capable of performing a wide 
range of tasks such as folding laundry [I], and unloading the 
dishwasher While many of these tasks will come pre- 
programmed from the factory, we would also like the robots 
to acquire new tasks from their human owners. For the gen- 
eral population, this demands a simple and robust method 
of communicating new tasks. Through this paper we hope 
to promote the use of the partially observable Markov deci- 
sion processes (POMDP) as a framework for controlling the 
robot during these task communication phases. The idea 
is that we represent the unknown task as a set of hidden 
random variables. Then, if the robot is given appropriate 
models of the human, it can choose actions that elicit infor- 
mative responses from the human, allowing it to infer the 
value of these hidden random variables. We formalize this 
idea in the Framework section. This approach makes the 
robot an active participant in task communication]^] 



Note that we distinguish "task communication" from "task 
execution". Once a task has been communicated it might 
then be associated with a trigger for later task execution. 
This paper deals with communicating the details of a task, 
not commanding the robot to execute a task; i.e. task com- 
munication not task execution. 

Many researchers have addressed the problem of task com- 
munication. A common approach is to control the robot 
during the teaching process, and demonstrate the desired 
task [4][5][6]. The problem for the robot is then to infer the 
task from examples. Our proposal addresses the more gen- 
eral case in which the robot must actively participated in the 
communication, choosing actions to facilitate task inference. 

In other work, as in ours, the task communication is more 
hands off, requiring the robot to choose actions during the 
communication, with much of the work using binary ap- 
proval feedback as in our experiment below [7j.8 9 . Our 
work differs in that we propose the use of a POMDP repre- 
sentation, while prior work has created custom representa- 
tions with ad hoc inference and action selection procedures. 

An exception is the Sophie's Kitchen work which used the 
widely accepted MDP representation 10 . Their approach 



1 Though related, this is different from an active learning 
problem pj], since the interaction in task communication is 
less structured than in the supervised learning setting. 



differs in many ways from ours but most importantly is in 
the way actions are selected during task communication. In 
their work the robot repeatedly executes the task, with some 
noise, as best it currently knows it, while in our approach the 
robot chooses actions to become more certain about the task. 
Intuitively, if the goal of the interaction is to communicate a 
task as quickly as possible, then repeatedly executing the full 
task as you currently believe it, is likely not the best policy. 
Instead, the robot should be acting to reduce uncertainty 
specifically about the details of the task that it is unclear 
on. In order to generate these uncertainty reducing actions 
we feel that a representation allowing for hidden state is 
needed, and we propose the POMDP. In a POMDP there 
can be a distribution over details of the task, and actions can 
be generated to reduce the uncertainty in this distribution. 

Substantial work has also been done on human assisted 
learning of low level control policies, such as the mountain 
car experiments, where the car must learn a throttle policy 
for getting out of a ravine [II]. While the mode of input 
is the same as is used in our experiment (a simple reward- 
ing input signal), we are addressing different problems and 
different solutions are appropriate. They are addressing the 
problem of transferring a control policy from a human to 
the robot, where explicit conversation actions to reduce un- 
certainty would be inefficient and treating the human in- 
put as part of an environmental reward is appropriate. In 



contrast we are addressing the problem of communicating 
higher level tasks, such as setting the table, in which case 
explicitly modeling the human and taking communication 
actions to reduce uncertainty is beneficial, and treating the 
human input as observations carrying information about the 
task details is appropriate. 

In the next section we present our approach of represent- 
ing task communication as a POMDP, and in the following 
section we describe a user-experiment which demonstrates 
our approach and highlights the advantages. 



2. FRAMEWORK 

2.1 Definitions 

Random variables will be designated with a capital letter 
or a capitalized word; e.g. X. The values of the random vari- 
able will be written in lowercase; e.g. X = x. If the value of 
a random variable can be directly observed then we will call 
it an "observable" random variable. If the value of a random 
variable cannot be directly observed then we will call it a 
"hidden" random variable, also known as a "latent" random 
variable. If the random variable is "sequential", meaning it 
changes with time, then we provide a subscript to refer to 
the time index; e.g. Mt below. A random variable can be 
multidimensional; e.g. the state 5* below is made up of other 
random variables: Mov, Mt, etc. If a set of random vari- 
ables contains at least one hidden random variable then we 
will call it "partially observable". 

P(X) is the probability distribution defined over the do- 
main of the random variable X. P(x) is the value of this 
distribution for the assignment of x to X. P(X\Y) is a 
"conditional" probability distribution, and defines the prob- 
ability of a value of X given a value of Y. A "marginal" 
probability distribution is a probability distribution that re- 
sults from summing or integrating out other random vari- 
ables; e.g. P(X,Y) : V xeX ,veY p (x,y) = E ze z p i x ; V, z )- 
When a probability distribution has a specific name, such 
as the "transition model" or the "observation model" we will 
use associated letters for the probability distribution; e.g. 
T(...) or 

2.2 POMDP Review 

A POMDP representation is applicable to action selec- 
tion problems where the environment is sequential and par- 
tially observable. A POMDP is a 6 element tuple: (S, A, 
0,T,SI, C,j) . S is the set of possible world states; A is 
the set of possible actions; O is the set of possible obser- 
vations that the agent may measure; T is the transition 
model defining the stochastic evolution of the world (the 
probability of reaching state s' £ S given that the action 
a £ A was taken in state s £ S); Q is the observation 
model that gives the probability of measuring an observa- 
tion o £ O given that the world is in a state s £ S; C is a 
cost function which evaluates the penalty of a state s £ S 
or the penalty of a probability distribution over states b, 
b : V s gs [0 < b(s) < 1 and ^2 a€S b(s) = l] ; and, finally, 7 is 
the discount rate for the cost function. 

Given a POMDP representation, a POMDP solver seeks a 
policy 7r, 7r(6) : 6 — > a, that minimizes the sum of discounted 



expected cost[J] The cost is given by C, the discounting is 
given by 7, and the expectation is computed using T and f2. 

For further reading on POMDPs see [12], [13], and [Ti] , 
For an introduction to POMDP solvers, with further refer- 
ences, see [l3] chapter 15. 

POMDPs have been successfully applied to problems as 
varied as autonomous helicopter flight [15], mobile robot 
navigation [16], and action selection in nursing robots [17] . 

2.3 Approach 

We propose the use of a POMDP for representing the 
problem of a human communicating a task to a robot. Specif- 
ically, for the elements of the POMDP tuple (S, A, O, T, Q, 
C, 7), the partially observable state S captures the details of 
the task along with potentially the mental state of the hu- 
man (helpful for interpreting ambiguous observations); the 
set of actions A capture all possible actions of the robot 
during communication (for example words, gestures, body 
positions, etc.); the set of observations O capture all signals 
from the human (be they words, gestures, buttons, etc.); 
and the cost function C should encode the desire to mini- 
mize uncertainty over the task details in S. The transition 
model T, the observation model Q, and the discount rate 7 
fill their usual POMDP roles. The experiment below pro- 
vides an example of T and Q for task communication. 

3. EXPERIMENT 
3.1 Simulator 

The experiment was run using the simulator shown in fig- 
ure [T] The virtual world consists of 3 balls and the robot, 
displayed as circles and a square (fy a). The robot can "ges- 
ture" at balls by lighting them (Tub), it can pick up a ball 
|l]d), it can slide one ball to one of four distances from an- 
other ball |l]e), and it can signal that it knows the task by 
displaying "Final" (]l]f). For the comparison trials, in which 
a human is controlling the robot, these action are controlled 
by left and right mouse licks on the objects involved in the 
action; e.g. right click on a ball to light it or turn off the 
light, left click on a ball to pick it up, etc. In this interface 
the teacher input is highly constrained; the teacher has only 
one control and that is the keyboard spacebar, visually in- 
dicated by a one half second rectangle |l]c), and is used to 
indicate approval with something that the robot is doing or 
has done. The simulator is networked so that two views can 
be opened at once; this is important for the comparison tri- 
als, where the human controlling the robot must be hidden 
from view. 

Timesteps are 0.5 second long, i.e. the robot receives an 
observation and must generate an action every 0.5 seconds. 
The simulator is free running, so, in the comparison trials 
where the human controls the robot, if the human does not 
select an action, then the "no action" action is taken, "no 
action" actions are still taken in the case of the POMDP con- 
trolled robot, but they are always intentional actions that 

2 Often a reward function is used instead of a cost function, 
but these are interchangeable; minimizing the cost function 
C is the same as maximizing the reward function -C. 
3 We will use the formulation of C as a cost function over 
a probability distribution, since in communication we care 
that the communication is successful, not what was commu- 
nicated. 
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Figure 1: Typical human- robot interactions on the 
simulator, (a) The state all trials start in; the square 
robot, holding no balls, surrounded by the three 
balls, (b) The robot lighting one of the balls, as if to 
ask, "move this?", (c) The human teacher pressing 
the keyboard spacebar, displayed in the simulator 
as a rectangle, and used to indicate approval with 
something the robot has done, (d) The robot hold- 
ing one of the balls, the four relative distances are 
displayed to the remaining two balls, (e) The robot 
has slid one of the balls to the furthest distance from 
another ball, (f) the robot has displayed "Final", in- 
dicating that it knows the task, and that the world 
is currently in a state consistent with that task. 



have been selected by the robot. We provide enough pro- 
cessing power so that the POMDP controlled robot always 
has an action ready in the allotted 0.5 seconds. 

The simulated world is discrete, observable, and determin- 
istic. 

3.2 Toy Problem 

For this experiment, the human teacher was asked to com- 
municate, through only spacebar presses, that a specific ball 
should be at a specific distance from another specific ball. 
The teacher was told that a spacebar press would be inter- 
preted by the robot as approval of something that it was 
doing. At the beginning of each trial the teacher was shown 
a card designating the ball relationship to teach. The robot, 
either POMDP or human controlled, had to infer the re- 
lationship from spacebar presses (elicited from its actions). 
When the robot thought it knew the relationship, it would 
end the trial by moving the world to that relationship and 
displaying "Final". The teacher would then indicate on pa- 
per whether the robot was correct and how intelligent they 
felt the robot in that trial was. 

Figure [l] shows snap shots from one typical trial. The 



robot "questioned" which ball to move |T]b), the teacher in- 
dicated approval |T]c) , the robot picked up the ball |l]d), 
the robot generally "questioned" which ball to move the one 
it was holding to (not shown), the robot slid the ball to- 
ward another ball |l]e), the teacher approved a distance or 
progress toward a distance (not shown), and, after further 
exploration, the robot indicated that it knew the task by 
displaying "Final" |l]f). 

Although this problem is simplistic, a robot whose behav- 
iors consist of chainings of these simple two object relation- 
ship tasks could be useful; e.g. for the "set the table" task: 
move the plate to zero inches from the placemat, move the 
fork to one inch from the plate, move the spoon to one inch 
from the fork, etc. 

We chose the spacebar press as the input signal in our ex- 
periment because it carries very little information, requiring 
the robot to infer meaning from context, which is a strength 
of our approach. For a production robot, this constrained 
interface should likely be relaxed to include signals such as 
speech, gestures, or body language. These other signals are 
also ambiguous, but the simplicity of a spacebar press made 
the uncertainty obvious for our demonstration. 

3.3 Formulation 

3.3.1 State (S) 

The state S is composed of hidden and observable random 
variables. 

The task that the human wishes to communicate is cap- 
tured in three hidden random variables Mov, WRT, and 
Dist. Mov is the index of the ball to move (1 — 3). WRT is 
the index of the ball to move ball Mov with respect to. Dist 
is the distance that ball Mov should be from ball WRT. 

The state also includes a sequential hidden random vari- 
able, M t , for interpreting the observations Ot- Mt takes on 
one of five values: (waiting, mistake, thatjmov, thatjwrt, 
or that_dist). A value of waiting implies that the human is 
waiting for some reason to press the spacebar. A value of 
mistake implies that the human accidentally pressed the 
spacebar. A value of thatjmov implies that the human 
pressed the spacebar to indicate approval of the ball to 
move. A value of thatjwrt implies that the human pressed 
the spacebar to indicate approval of the ball to move ball 
Mov with respect to. A value of that_dist implies that the 
human pressed the spacebar to indicate approval of the dis- 
tance that ball Mov should be from ball WRT. 

The state also includes observable random variables for 
the physical state of the world; e.g. which ball is lit, which 
ball is being held, etc. 

Finally, the state includes "memory" random variables for 
capturing historical information, e.g. the last time step that 
each of the balls were lit, or the last time step that M = 
thatjmov. The historical information is important for the 
transition model T. For example, humans typically wait one 
to five seconds before pressing the spacebar a second time. 
In order to model this accurately we need the time step of 
the last spacebar press. 
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3.3.2 Actions (A) 

There are six parameterized actions that the robot may 
perform. Certain actions may be invalid depending on the 
state of the world. The actions are: noa, for perform- 
ing no action and leaving the world in the current state; 
light jon(index) or light_off (index), for turning the light 
on or off for the ball indicated by index; pick_up(index) , for 
picking up the ball indicated by index; release(index) , for 
putting down the ball indicated by index; and slide(index_l, 
distance, indexJl), for sliding the ball indicated by indexJi 
to the distance indicated by distance relative to the ball in- 
dicated by index Jl. Note that only a few actions are valid 
in any world state; for example, slide(indexJl, distance, 
index_2) is only valid if ball index_\ is currently held or 
currently at a distance from ball index^l and if distance is 
only one step away from the current distance. 
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3.3.3 Observations (O) 

An observation takes place at each time step and there 
are two valid observations: spacebar or nospacebar, cor- 
responding to whether the human pressed the spacebar on 
that time step. 



observations 



spacebar 
no_spacebar 



3. 3. 4 Transition Model (T) 

A transition model gives the probability of reaching a 
new state, given an old state and an action. In this exam- 
ple, Mov, WRT, and Dist are non-sequential random vari- 
ables, meaning they do not change with time, so T(Mov = 
i, ...\Mov = i, ...) = 1.0. The transition model for the physi- 
cal state of the virtual world is also trivial, since the virtual 
world is deterministic. 

The variable of interest in this example for the transition 
model is the sequential random variable M that captures 
the mental state of the human (waiting, mistake, thatjmov, 
thatjwrt, or that_dist) . The transition model was specified 
from intuition, but in practice we envision that it would ei- 
ther be specified by psychological experts, or learned from 
human-human or human-robot observations. For the exper- 
iment we set the probability that M transitions to mistake 
from any state to a fixed value of 0.005, meaning that at 
any time there is a 0.5% chance that the human will mis- 
takenly press the spacebar indicating approval. We define 
the probability that M transitions to thatjmov, thatjwrt, 
or that_dist as a table-top function, as shown in figure: [2] 
We set the probability that M transitions to waiting to 
the remaining probability; T(M — waiting) = 1 — T(M = 
mistake V thatjmov V thatjwrt V that_dist) . 

3.3.5 Observation Model (ft) 

The observation model we have chosen for this problem is 
a many to one deterministic mapping: 



T(M=that_mov | Mov=i,...) 
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Figure 2: This is an illustration of part of the 
transition model T. Here we show the probabil- 
ity that the human will signal their approval (via 
a spacebar press) of the ball to be moved, T(M = 
that_mov\Mov = i, ...), where, in this hypothesis, the 
ball to be moved is ball i. (a) once the robot has lit 
ball i, the probability increases from zero to a peak 
of 0.1 over 2 seconds, (b) after the light has turned 
off there is still probability of an approving space- 
bar press, but decreasing over 2 seconds, (c) If the 
teacher has signaled their approval (M = thatjmov), 
then the probability resets. The structure and shape 
of these models was set from intuition. 



Note that this deterministic model does not imply that 
the state is observable since, given O = spacebar, we do not 

know why the human pressed the spacebar, M — (mistaken 
thatjmov V thatjwrt V that_dist) . ^] 

3.3.6 Cost Function (C) 

As mentioned earlier, the cost function should be chosen 
to motivate the robot to quickly and accurately infer what 
the human is trying to communicate. In our case this is 
a task captured in the random variables Mov, WRT, and 
Dist. The cost function we have chosen is the entropy of 
the marginal distribution over Mov, WRT, and DIST^] 



CO) = -J2p(x) ■ log(p(x)). 



(1) 



P(0 = spacebar\M) 



0.0 if M = waiting 
1.0 otherwise 



Where p is the marginal probability distribution over Mov, 
WRT, and Dist, and x takes on all permutations of the 
value assignments to Mov, WRT, and Dist. 

Since entropy is a measure of the uncertainty in a proba- 
bility distribution, this cost function will motivate the robot 
to reduce its uncertainty over Mov, WRT, and Dist, which 
is what we want. 

3.3.7 Discount Rate (7) 

7 is set to 1.0 in our experiment, meaning that uncertainty 
later is just as bad as uncertainty now. The valid range of 7 
for a POMDP solver evaluating actions to an infinite horizon 
is < 7 < 1.0, but our solver only evaluates to a 2.5 second 
horizon. In practice, the larger the value of 7, the more 
willing the robot is to defer smaller gains now for larger 
gains later. 

3.4 Action Selection 

The problem of action selection is the problem of solv- 
ing the POMDP. There are many established techniques for 

4 If there were noise in the spacebar key then this would not 
be a deterministic mapping. 

5 By "entropy" we mean information entropy, not thermody- 
namics entropy. 



solving POMDPs [18]. In our case, given the simplicity of 
the world and the problem, we could take a direct approach. 
The robot expands the action-observation tree out 2.5 sec- 
onds into the future, and takes the action that minimizes 
the sum of expected entropy over this tree. This solution is 
approximate, since we only look ahead 2.5 seconds, but, as 
we will show shortly, it results in reasonable action selections 
for the toy problem used in the experiment. 

When the marginal probability of one of the assignments 
to Mov, WRT, and Dist is greater than 0.98 (over 98% con- 
fident in that assignment), then the robot moves the world 
to that assignment and displays "Final". 

3.5 Subjects 

The experiment involved 26 participants, consisting of un- 
dergraduate and graduate students ranging in age from 18 
to 31 with a mean age of 22. Four of the participants were 
randomly selected for the "human robot" role, leaving 22 
participants for the "teacher" role. The participants rated 
their familiarity with artificial intelligence software and sys- 
tems on a scale from 1 to 7; the mean score was 3.4 with a 
standard deviation of 1.9. Participants were paid $10.00 for 
their time in the experiment. 

3.6 Results 

3.6.1 Is Robust to Mistakes 

The strength of using a probabilistic approach such as 
a POMDP is in its robustness to noise. In our experi- 
ment noise came in the form of mistaken spacebar presses. 
Figure [3] illustrates a typical mistaken spacebar press. In 
this trial, at the three second mark, the human mistakenly 
pressed the spacebar while ball 1 was lit, when in fact ball 
1 was not involved in the task. As expected, the robot's 
marginal probability that ball 1 was the ball to move imme- 
diately spiked. Yet there was still a small probability that 
the random variable M equaled mistake at the three second 
mark. The trial proceeded with the robot making use of the 
strong belief that ball 1 was the ball to be moved: it picked 
up ball 1 at 4 seconds and lit ball 2 and ball 3. As time 
progressed, and the robot did not receive further spacebar 
presses that would be consistent with a task involving ball 
1, the probability that the human mistakenly pressed the 
spacebar increased and the probability that ball 1 was the 
ball to move decreased. At thirty six seconds, the belief that 
a mistake occurred was strong enough that the action which 
minimized the expected entropy was to put down ball 1 and 
continue seeking another ball to move. 

3.6.2 Can Infer Hidden State 

The second result from the experiment is that the robot 
accurately inferred the hidden task and the hidden state of 
the teacher. In all trials the human teachers reported that 
the robot was correct about the task being communicated. 
Figure [4] shows a look at the robot's marginal probabilities, 
for one of the trials, of the random variables Mov, WRT, 
and Dist. In this trial, as was typical of the trials, the 
robot first grew its certainty about Mov followed by WRT 
and then Dist. Figure [5] shows the probability of the true 
assignment to M at the time of the spacebar press and at 
the end of the trial, for four assignments to the variable 
This shows that as each trial progressed the robot 

6 We did not include before and after for M — waiting be- 
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Figure 3: This figure shows a typical recovery from a 
mistaken spacebar press. In this figure the teacher 
mistakenly pressed the spacebar at the three sec- 
ond mark while the robot was lighting ball 1. The 
probability that ball 1 was the ball to be moved im- 
mediately spiked. At the same time there was a low 
probability that the spacebar press was a mistake. 
At 4 seconds the robot picked up ball 1 and started 
moving it, exploring tasks involving the movement 
of ball 1. As the trial progressed without further 
spacebar presses, the probability that the spacebar 
press at 3 seconds was a mistake increased and the 
probability that ball 1 was the ball to move de- 
creased. Finally, at 36 seconds the approximately 
optimal policy was to put down ball 1 and reassess 
which object was to be moved. 



became correctly certain about what the human meant by 
each spacebar press. 

3.6.3 Can Reasonably Select Actions 

We captured two metrics in an effort to evaluate the qual- 
ity of the POMDP selected actions. 

The first was a subjective rating of the robot's intelligence 
by the teacher after each trial. Figure[6]shows the ratings for 
the human controlled robot and the ratings for the POMDP 
controlled robot. The human received higher intelligence 
ratings, but not significantly; we believe that this gap can 
be improved with better modeling (see section 4.1 1. 



The second metric we looked at was the time to com- 
municate the task. This was measured as the time until 
the robot displayed "Final". Figure [7] is a histogram of the 
time until the robot displayed "Final" for the POMDP robot 
and for the human controlled robot. Here again the hu- 
man controlled robot outperformed the POMDP controlled 
robot, but the POMDP controlled robot performed reason- 
ably well. Part of this discrepancy could be due to inaccu- 
rate models, as in the intelligence ratings, but in this case we 
believe that the threshold for displaying "Final" was higher 
for the POMDP robot (over 98% confident) than for the 
human. Notably, we often observed the human controlled 
robot displaying "Final" after a single spacebar press at the 
final location. In contrast, the POMDP robot always ex- 
plored other distances; presumably to rule out the possibil- 
ity that the first spacebar press was a mistake. Only after 
a second spacebar press would the POMDP robot display 



cause the observation model Q. makes this assignment deter- 
ministic. 
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Figure 4: This figure shows the robot's inference of 
the ball to mov (a), the ball to move it with respect 
to (b), and the distance between the balls (c) for 
one of the trials. The vertical lines designate space- 
bar presses. The solid line in each figure shows the 
marginal probability of the true assignment for that 
random variable. The marginal probabilities for the 
true assignments are driven to near 1.0 by informa- 
tion gathered from the spacebar presses elicited by 
the robot's actions. 



"Final". 

Of interest as well is the POMDP robot's ability to drive 
down the cost function over each trial. Figure [8] plots the 
cost function (entropy) as a function of time for each of the 
trials. During several trials the entropy increased signifi- 
cantly before dropping again. This corresponds to the trials 
in which the teacher mistakenly pressed the spacebar; the 
POMDP robot initially believed that there was information 
in the key press, but over time realized that it was a mistake 
and carried no information. The figure shows the reduction 
of entropy in all trials to near zero. 

4. FUTURE WORK 

4.1 Learning Model Structure and Model Pa- 
rameters 

In our experiment the structure and parameters of T and 
Q, were set from intuition. We believe that both the structure 
and the parameters of the models can be "learned", in the 
machine learning sense. The models could be learned either 
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Figure 5: This figure shows the marginal probability 
of the four approval mental states at the time the 
spacebar was pressed (dark gray) and at the end of 
the trial (light gray). The true states were labeled in 
a post processing step. All spacebar presses from all 
22 POMDP trials are included. This shows that, for 
each of the mental states, the marginal probability 
of the correct state increases as the trial progresses 
and ends at near certainty. This is most pronounced 
in the case of M = mistake, in which the initial prob- 
ability that the spacebar was a mistake is low, but 
increases dramatically as the trial progresses. 



from observations of humans communicating with other hu- 
mans or from observations of humans communicating with 
robots. In work presently under development, we are fitting 
the parameters of our model to logs of human teachers teach- 
ing human controlled robots. We intend to continue along 
this line, as it may be unrealistic to expect social scientists 
to accurately and exhaustively model human teachers. 

4.2 Complex Tasks 

In our experiment the task communicated consisted of a 
single object movement. Future work will aim to communi- 
cate chains of object movements. 

4.3 Complex Signals 

In our experiment we limited observations to the space- 
bar key presses. As we move to tasks involving object move- 
ments in the real world we plan to incorporate further obser- 
vations such as the gaze direction of the teacher and pointing 
gestures from the teacher, perhaps using a laser pointer [19] . 
Note that social behavior such as shared attention, argued 
for in [20] , where the robot looks at the human to see where 
they are looking, would naturally emerge once the teacher's 
gaze direction, along with the appropriate models, is added 
as an observation to the system; knowing what object the 
human is looking at is informative (reduces entropy), so ac- 
tions leading to the observation of the gaze direction would 
have low expected entropy and would likely be chosen. 

4.4 Processing 

Substantial progress has been made towards efficient so- 
lutions of POMDPs 
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yet processing remains a signifi- 
cant problem for POMDPs with complex domains. In fu- 
ture work we hope to improve on these techniques by lever- 
aging parallel processing resources such as server clusters 
(e.g. Amazon's Elastic Compute Cluster (EC2)), or graph- 
ics cards enabled for general purpose computing (GPGPUs). 
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Figure 6: The 22 human teachers each partici- 
pated in two trials, one teaching the human con- 
trolled robot and one teaching the POMDP con- 
trolled robot. The order of human or POMDP robot 
was randomized, and the true identity of the robot 
controller was hidden from the teacher. Following 
each trial the teacher rated the intelligence of the 
robot on a scale from 1 to 10, with 10 being the 
most intelligent. With the exception of one teacher, 
all teachers rated the human controlled robot the 
same or more intelligent than the POMDP con- 
trolled robot (mean of 9.30 vs. 8.26). 
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Figure 7: A histogram of the times until the robot, 
human or POMDP controlled, displayed "Final". 
The robot displayed "Final" to signal that they 
knew the task and that the world was displaying 
the task. The POMDP controlled robot displayed 
"Final" when the marginal probability for a par- 
ticular task, P(Mov = i,WRT — j,Dist = k), was 
greater than 0.98. In all trials the robot, human 
or POMDP controlled, correctly inferred the task. 
Task communication, as expected, took longer for 
the POMDP controlled robot than for the human 
controlled robot. 



4.5 Smooth Task Communication and Task Ex- 
ecution Transitions 

This paper focused on task communication, but a robot 
will also spend time executing communicated tasks. We 
would like for the formulation to apply to the entire op- 
eration of the robot; with optimal transitions between task 
communication and task execution. A choice of a broader 
cost function will be an important first step. One choice for 
this cost function might be the cost to the human under the 
human's cost function. The human's cost function would 
be captured in random variables, perhaps through a non- 
parametric model. The POMDP solver could then choose 
actions which would inform it about the human's cost func- 
tion, which would aid in minimizing the cost to the human. 
Note that task communication would still occur under this 
cost function; for example, the robot might infer that doing 
a task is painful to the human, and communication would al- 
low the robot to do this task for the human, thus performing 
communication actions would be attractive^ 

4.6 IPOMDP 

In a classical POMDP the world is modeled as stochastic, 
but not actively rational; e.g. days transition from sunny 
to cloudy with a certain probability, but not as the result 
of the actions of an intelligent agent. In a POMDP the 
agent is the only intelligence in the world. An Interactive 
POMDP (IPOMDP) is one of several approaches that ex- 
tend the POMDP to multiple intelligent agents [22]. It dif- 
fers from game theoretic approaches in that it takes the per- 
spective of an individual agent, rather than analyzing all 
agents globally; the individual agent knows that there are 

7 Research has shown that inferring another agent's cost 
function is possible (see inverse reinforcement learning) [21] . 



other intelligent agents in the world acting to minimize some 
cost function, but the actions of those agents and their cost 
functions may be only partially observable. We feel that the 
task communication problem falls into this category. The 
human teacher has objectives and reasons for communicat- 
ing the task, knowing those reasons could allow the robot to 
better serve the human. Understanding the human and their 
objectives is important to the smooth communication and 
execution transitions described before. Thus future work 
will extend our framework from the POMDP representation 
to the IPOMDP representation. 

Unfortunately, an IPOMDP adds exponential branching 
of inter-agent beliefs to the already exponential branching 
of probability space and action-observations in a POMDP. 
Thus, while it is a more accurate representation it does make 
a hard problem even harder. That said, an IPOMDP may 
serve as a good formulation that we then seek approximate 
solutions for. 

5. SUMMARY 

In this paper we proposed the use of a POMDP for repre- 
senting the human-robot task communication problem, and 
we demonstrated the approach through a user experiment. 
The experiment suggested that this representation results 
in robots that are robust to teacher error, that can accu- 
rately infer task details, and that are perceived to be in- 
telligent. Future work will include more complex tasks and 
signals, better modeling of the teacher, and improved pro- 
cessing techniques. 
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